Rich Statistical Parsing and Literary Language

نویسنده

  • Andreas Wolf van Cranenburgh
چکیده

This thesisapplies the Data-Oriented Parsing framework in two areas:parsing & literature. The data-oriented approach rests on the assumptionthat re-use of chunks of training data can be detected and exploited attest time. Syntactic tree fragments form the common thread in the thesis.Chapter 2 presents a method to efficiently extract them from treebanks,based on heuristics of re-occurrence. This method is thus able to discoverthe potential building blocks of large corpora. Chapter 3 then develops amulti-lingual statistical parser based on tree-substitution grammar that handlesdiscontinuous constituents and function tags. We show how a mildly context-sensitive grammar can be employed to produce discontinuous constituents,and then compare this to an approximation that stays within the efficientlyparsable context-free framework. The conclusion from the empirical evaluationis that tree fragments allow the grammar to adequately capture the statisticalregularities of non-local relations, without the need for the increased generativecapacity of mildly context-sensitive grammar.The second part investigates what separates literary from other novels. Asidefrom an introduction in Chapter 4 to machine learning we discuss the differencebetween explanation and prediction.Chapter 5 discusses the data used for this investigation. We work with acorpus of novels and a reader survey with ratings of how literary novels areperceived to be. While considerable questions remain with respect to whether asurvey of the general public is an appropriate instrument to probe the conceptof literature, when viewed as a barometer of public opinion we may considerthe basic question of whether such opinions are at all predictable. The first goalis therefore to find out the extent to which the literary ratings can be predictedfrom the texts; the second, more challenging goal is to characterize the kind ofpatterns that are predictors of more or less literary texts.Chapter 6 establishes baselines for this question. We show that literary novelscontain less adjectives and adverbs than non-literary novels, and present severalsimple measures that are significantly correlated with the literary ratings, suchas vocabulary richness and text compressibility. Cliché expressions is establishedas a negative marker of literary language. A topic model is developed of the

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

تأثیر ساخت‌واژه‌ها در تجزیه وابستگی زبان فارسی

Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...

متن کامل

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing

This paper gives two contributions to dependency parsing in Korean. First, we build a Korean dependency Treebank from an existing constituent Treebank. For a morphologically rich language like Korean, dependency parsing shows some advantages over constituent parsing. Since there is not much training data available, we automatically generate dependency trees by applying head-percolation rules an...

متن کامل

Learning Semantic Parsers for Natural Language Understanding

For building question answering systems and natural language interfaces, semantic parsing has emerged as an important and powerful paradigm. Semantic parsers map natural language into logical forms, the classic representation for many important linguistic phenomena. The modern twist is that we are interested in learning semantic parsers from data, which introduces a new layer of statistical and...

متن کامل

The AI-KU System at the SPMRL 2013 Shared Task : Unsupervised Features for Dependency Parsing

We propose the use of the word categories and embeddings induced from raw text as auxiliary features in dependency parsing. To induce word features, we make use of contextual, morphologic and orthographic properties of the words. To exploit the contextual information, we make use of substitute words, the most likely substitutes for target words, generated by using a statistical language model. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016